<html>
<head><meta charset="utf-8"><title>aarch64 self-hosted agents crashing · t-infra · Zulip Chat Archive</title></head>
<h2>Stream: <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/index.html">t-infra</a></h2>
<h3>Topic: <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/aarch64.20self-hosted.20agents.20crashing.html">aarch64 self-hosted agents crashing</a></h3>

<hr>

<base href="https://rust-lang.zulipchat.com">

<head><link href="https://rust-lang.github.io/zulip_archive/style.css" rel="stylesheet"></head>

<a name="206396316"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/aarch64%20self-hosted%20agents%20crashing/near/206396316" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/aarch64.20self-hosted.20agents.20crashing.html#206396316">(Aug 09 2020 at 14:03)</a>:</h4>
<p>gah I'm dumb</p>



<a name="206396364"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/aarch64%20self-hosted%20agents%20crashing/near/206396364" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/aarch64.20self-hosted.20agents.20crashing.html#206396364">(Aug 09 2020 at 14:04)</a>:</h4>
<p>a while ago I noticed the self-hosted GHA agents were killed mid-build like macOS</p>



<a name="206396365"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/aarch64%20self-hosted%20agents%20crashing/near/206396365" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/aarch64.20self-hosted.20agents.20crashing.html#206396365">(Aug 09 2020 at 14:04)</a>:</h4>
<p>for example <a href="https://github.com/rust-lang-ci/rust/runs/963328848">https://github.com/rust-lang-ci/rust/runs/963328848</a></p>



<a name="206396371"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/aarch64%20self-hosted%20agents%20crashing/near/206396371" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/aarch64.20self-hosted.20agents.20crashing.html#206396371">(Aug 09 2020 at 14:04)</a>:</h4>
<p>so I was starting to worry that we were having a similar problem to what GH was having for the hosted macOS builders</p>



<a name="206396434"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/aarch64%20self-hosted%20agents%20crashing/near/206396434" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/aarch64.20self-hosted.20agents.20crashing.html#206396434">(Aug 09 2020 at 14:06)</a>:</h4>
<p>well, it turns out that if I configure the hypervisor to restart the VM every three hours it will restart the VM every three hours</p>



<a name="206396436"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/aarch64%20self-hosted%20agents%20crashing/near/206396436" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/aarch64.20self-hosted.20agents.20crashing.html#206396436">(Aug 09 2020 at 14:06)</a>:</h4>
<p>even if a build is running</p>



<a name="206396438"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/aarch64%20self-hosted%20agents%20crashing/near/206396438" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/aarch64.20self-hosted.20agents.20crashing.html#206396438">(Aug 09 2020 at 14:06)</a>:</h4>
<p>who would've thought <span aria-label="face palm" class="emoji emoji-1f926" role="img" title="face palm">:face_palm:</span></p>



<a name="206396775"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/aarch64%20self-hosted%20agents%20crashing/near/206396775" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/aarch64.20self-hosted.20agents.20crashing.html#206396775">(Aug 09 2020 at 14:15)</a>:</h4>
<p>I'm wondering what the best approach to fix this is</p>



<a name="206396789"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/aarch64%20self-hosted%20agents%20crashing/near/206396789" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/aarch64.20self-hosted.20agents.20crashing.html#206396789">(Aug 09 2020 at 14:15)</a>:</h4>
<p>the reason we have the three hour timeout at the hypervisor level is to avoid a compromised VM from never stopping</p>



<a name="206396841"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/aarch64%20self-hosted%20agents%20crashing/near/206396841" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/aarch64.20self-hosted.20agents.20crashing.html#206396841">(Aug 09 2020 at 14:16)</a>:</h4>
<p>but the hypervisor doesn't know when a build starts in the current implementation, so it can't start the timeout at the right time</p>



<a name="206396858"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/aarch64%20self-hosted%20agents%20crashing/near/206396858" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/aarch64.20self-hosted.20agents.20crashing.html#206396858">(Aug 09 2020 at 14:17)</a>:</h4>
<p>hm it seems like we shouldn't worry too much about that in an automated way</p>



<a name="206396863"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/aarch64%20self-hosted%20agents%20crashing/near/206396863" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/aarch64.20self-hosted.20agents.20crashing.html#206396863">(Aug 09 2020 at 14:17)</a>:</h4>
<p>we'll get notified by bors failing anyway, right?</p>



<a name="206396865"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/aarch64%20self-hosted%20agents%20crashing/near/206396865" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/aarch64.20self-hosted.20agents.20crashing.html#206396865">(Aug 09 2020 at 14:17)</a>:</h4>
<p>nope</p>



<a name="206396903"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/aarch64%20self-hosted%20agents%20crashing/near/206396903" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/aarch64.20self-hosted.20agents.20crashing.html#206396903">(Aug 09 2020 at 14:18)</a>:</h4>
<p>hm why not, wouldn't the build timeout?</p>



<a name="206396905"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/aarch64%20self-hosted%20agents%20crashing/near/206396905" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Alex Gaynor <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/aarch64.20self-hosted.20agents.20crashing.html#206396905">(Aug 09 2020 at 14:18)</a>:</h4>
<p>Does the github agent provide a way of triggering some action on the host after a build?</p>



<a name="206396923"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/aarch64%20self-hosted%20agents%20crashing/near/206396923" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/aarch64.20self-hosted.20agents.20crashing.html#206396923">(Aug 09 2020 at 14:19)</a>:</h4>
<p><span class="user-mention" data-user-id="116122">@simulacrum</span>  I mean if you have control of the VM you can do whatever you want, including stopping the agent and faking the "the build passed" API call to github</p>



<a name="206396931"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/aarch64%20self-hosted%20agents%20crashing/near/206396931" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/aarch64.20self-hosted.20agents.20crashing.html#206396931">(Aug 09 2020 at 14:19)</a>:</h4>
<p><span class="user-mention" data-user-id="130046">@Alex Gaynor</span> yes, but for this we can't trust the agent -- all the ephemeral VMs stuff was done to work around the security issues with self-hosted runners on public repositories</p>



<a name="206396934"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/aarch64%20self-hosted%20agents%20crashing/near/206396934" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/aarch64.20self-hosted.20agents.20crashing.html#206396934">(Aug 09 2020 at 14:19)</a>:</h4>
<p>hm sure, I guess</p>



<a name="206396935"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/aarch64%20self-hosted%20agents%20crashing/near/206396935" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/aarch64.20self-hosted.20agents.20crashing.html#206396935">(Aug 09 2020 at 14:19)</a>:</h4>
<p>but that seems like a separate issue?</p>



<a name="206396975"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/aarch64%20self-hosted%20agents%20crashing/near/206396975" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/aarch64.20self-hosted.20agents.20crashing.html#206396975">(Aug 09 2020 at 14:20)</a>:</h4>
<p>like, if someone does that, then killing the VM every 3 hours won't help</p>



<a name="206396977"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/aarch64%20self-hosted%20agents%20crashing/near/206396977" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/aarch64.20self-hosted.20agents.20crashing.html#206396977">(Aug 09 2020 at 14:20)</a>:</h4>
<p>(we're already shutting down the VM cleanly after a build if no compromise happens)</p>



<a name="206396988"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/aarch64%20self-hosted%20agents%20crashing/near/206396988" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/aarch64.20self-hosted.20agents.20crashing.html#206396988">(Aug 09 2020 at 14:20)</a>:</h4>
<p>well it limits the attack by preventing it from persisting</p>



<a name="206396992"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/aarch64%20self-hosted%20agents%20crashing/near/206396992" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/aarch64.20self-hosted.20agents.20crashing.html#206396992">(Aug 09 2020 at 14:20)</a>:</h4>
<p>hm okay, but I'm not sure I follow</p>



<a name="206396994"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/aarch64%20self-hosted%20agents%20crashing/near/206396994" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/aarch64.20self-hosted.20agents.20crashing.html#206396994">(Aug 09 2020 at 14:20)</a>:</h4>
<p>like, you say we're killing "if no compromise happens"</p>



<a name="206396997"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/aarch64%20self-hosted%20agents%20crashing/near/206396997" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/aarch64.20self-hosted.20agents.20crashing.html#206396997">(Aug 09 2020 at 14:21)</a>:</h4>
<p>how would a compromise of the VM stop that kill from occurring?</p>



<a name="206396999"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/aarch64%20self-hosted%20agents%20crashing/near/206396999" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/aarch64.20self-hosted.20agents.20crashing.html#206396999">(Aug 09 2020 at 14:21)</a>:</h4>
<p>(presuming they don't have VM escape)</p>



<a name="206397002"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/aarch64%20self-hosted%20agents%20crashing/near/206397002" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Alex Gaynor <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/aarch64.20self-hosted.20agents.20crashing.html#206397002">(Aug 09 2020 at 14:21)</a>:</h4>
<p>Assuming builds take &lt;3 hours, when is the "restart every 3 hours" being hit? Is that "every 3 hours" just noon, 3pm, 6pm, etc. regardless of when the VM started?</p>
<p>If yes, maybe the fix is "restart the VM after it's been alive 3 hours"</p>



<a name="206397069"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/aarch64%20self-hosted%20agents%20crashing/near/206397069" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/aarch64.20self-hosted.20agents.20crashing.html#206397069">(Aug 09 2020 at 14:23)</a>:</h4>
<p>the script at startup is basically:</p>
<div class="codehilite"><pre><span></span><code>start-gha-runner --once
poweroff
</code></pre></div>


<p>so if no attack happens the build runs, the agent shuts down as it's only supposed to do a single run (<code>--once</code>), and then we call <code>poweroff</code></p>



<a name="206397115"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/aarch64%20self-hosted%20agents%20crashing/near/206397115" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/aarch64.20self-hosted.20agents.20crashing.html#206397115">(Aug 09 2020 at 14:24)</a>:</h4>
<p>can we wrap the 'qemu-vm' or however we're bringing it up in a timeout?</p>



<a name="206397119"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/aarch64%20self-hosted%20agents%20crashing/near/206397119" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/aarch64.20self-hosted.20agents.20crashing.html#206397119">(Aug 09 2020 at 14:24)</a>:</h4>
<p>or pass in some token that has to be unique (i.e., only usable once)</p>



<a name="206397120"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/aarch64%20self-hosted%20agents%20crashing/near/206397120" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/aarch64.20self-hosted.20agents.20crashing.html#206397120">(Aug 09 2020 at 14:24)</a>:</h4>
<p>if an attacker gains control of the VM they could send a manual "the build finished" API call to github, kill that script (so the <code>poweroff</code> is never executed), and start another agent without <code>--once</code>: then, all future builds on that agent will run on the compromised code</p>



<a name="206397138"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/aarch64%20self-hosted%20agents%20crashing/near/206397138" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/aarch64.20self-hosted.20agents.20crashing.html#206397138">(Aug 09 2020 at 14:25)</a>:</h4>
<blockquote>
<p>or pass in some token that has to be unique (i.e., only usable once)</p>
</blockquote>
<p>unfortunately github's agent auth doesn't support that as far as I'm aware</p>



<a name="206397183"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/aarch64%20self-hosted%20agents%20crashing/near/206397183" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/aarch64.20self-hosted.20agents.20crashing.html#206397183">(Aug 09 2020 at 14:26)</a>:</h4>
<blockquote>
<p>If yes, maybe the fix is "restart the VM after it's been alive 3 hours"</p>
<p>can we wrap the 'qemu-vm' or however we're bringing it up in a timeout?</p>
</blockquote>
<p>that's what we're doing right now</p>



<a name="206397195"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/aarch64%20self-hosted%20agents%20crashing/near/206397195" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/aarch64.20self-hosted.20agents.20crashing.html#206397195">(Aug 09 2020 at 14:26)</a>:</h4>
<p>but github doesn't have a webhook "hey I need a new VM"</p>



<a name="206397202"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/aarch64%20self-hosted%20agents%20crashing/near/206397202" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/aarch64.20self-hosted.20agents.20crashing.html#206397202">(Aug 09 2020 at 14:26)</a>:</h4>
<p>hm I'm confused then</p>



<a name="206397203"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/aarch64%20self-hosted%20agents%20crashing/near/206397203" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/aarch64.20self-hosted.20agents.20crashing.html#206397203">(Aug 09 2020 at 14:26)</a>:</h4>
<p>so the only thing we can do is to start the VM ahead of time and wait for a job to be assigned to it</p>



<a name="206397205"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/aarch64%20self-hosted%20agents%20crashing/near/206397205" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/aarch64.20self-hosted.20agents.20crashing.html#206397205">(Aug 09 2020 at 14:26)</a>:</h4>
<p>ah</p>



<a name="206397209"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/aarch64%20self-hosted%20agents%20crashing/near/206397209" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/aarch64.20self-hosted.20agents.20crashing.html#206397209">(Aug 09 2020 at 14:27)</a>:</h4>
<p>which is why the "kill after 3 hours" breaks things</p>



<a name="206397211"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/aarch64%20self-hosted%20agents%20crashing/near/206397211" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/aarch64.20self-hosted.20agents.20crashing.html#206397211">(Aug 09 2020 at 14:27)</a>:</h4>
<p>so the problem happens if we have no build for X time and then the build starts and doesn't finish in the 3 hour window?</p>



<a name="206397212"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/aarch64%20self-hosted%20agents%20crashing/near/206397212" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/aarch64.20self-hosted.20agents.20crashing.html#206397212">(Aug 09 2020 at 14:27)</a>:</h4>
<p>yep</p>



<a name="206397215"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/aarch64%20self-hosted%20agents%20crashing/near/206397215" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/aarch64.20self-hosted.20agents.20crashing.html#206397215">(Aug 09 2020 at 14:27)</a>:</h4>
<p>sorry should've explained better <span aria-label="sweat smile" class="emoji emoji-1f605" role="img" title="sweat smile">:sweat_smile:</span></p>



<a name="206397218"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/aarch64%20self-hosted%20agents%20crashing/near/206397218" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/aarch64.20self-hosted.20agents.20crashing.html#206397218">(Aug 09 2020 at 14:27)</a>:</h4>
<p>how does github ping us in the first place? HTTP hit?</p>



<a name="206397220"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/aarch64%20self-hosted%20agents%20crashing/near/206397220" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/aarch64.20self-hosted.20agents.20crashing.html#206397220">(Aug 09 2020 at 14:28)</a>:</h4>
<p>the API the agent uses is private and undocumented</p>



<a name="206397259"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/aarch64%20self-hosted%20agents%20crashing/near/206397259" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/aarch64.20self-hosted.20agents.20crashing.html#206397259">(Aug 09 2020 at 14:28)</a>:</h4>
<p>can we have a mini rust server or something that catches that, starts the vm, then passes the request into the VM?</p>



<a name="206397261"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/aarch64%20self-hosted%20agents%20crashing/near/206397261" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/aarch64.20self-hosted.20agents.20crashing.html#206397261">(Aug 09 2020 at 14:28)</a>:</h4>
<p>I guess is some form of long polling from the agent</p>



<a name="206397262"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/aarch64%20self-hosted%20agents%20crashing/near/206397262" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/aarch64.20self-hosted.20agents.20crashing.html#206397262">(Aug 09 2020 at 14:28)</a>:</h4>
<p>(CGI)</p>



<a name="206397266"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/aarch64%20self-hosted%20agents%20crashing/near/206397266" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/aarch64.20self-hosted.20agents.20crashing.html#206397266">(Aug 09 2020 at 14:28)</a>:</h4>
<p>ah okay, long polling wouldn't work :/</p>



<a name="206397270"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/aarch64%20self-hosted%20agents%20crashing/near/206397270" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/aarch64.20self-hosted.20agents.20crashing.html#206397270">(Aug 09 2020 at 14:28)</a>:</h4>
<p>we could, but I'm not too keen to reverse engineer the whole API</p>



<a name="206397275"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/aarch64%20self-hosted%20agents%20crashing/near/206397275" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/aarch64.20self-hosted.20agents.20crashing.html#206397275">(Aug 09 2020 at 14:29)</a>:</h4>
<p>can we catch the start of the build somehow? CPU usage spike? :)</p>



<a name="206397280"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/aarch64%20self-hosted%20agents%20crashing/near/206397280" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Joshua Nelson <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/aarch64.20self-hosted.20agents.20crashing.html#206397280">(Aug 09 2020 at 14:29)</a>:</h4>
<blockquote>
<p>but github doesn't have a webhook "hey I need a new VM"</p>
</blockquote>
<p>Can we ask them to add one?</p>



<a name="206397281"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/aarch64%20self-hosted%20agents%20crashing/near/206397281" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/aarch64.20self-hosted.20agents.20crashing.html#206397281">(Aug 09 2020 at 14:29)</a>:</h4>
<p>I guess, alternatively</p>



<a name="206397286"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/aarch64%20self-hosted%20agents%20crashing/near/206397286" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/aarch64.20self-hosted.20agents.20crashing.html#206397286">(Aug 09 2020 at 14:29)</a>:</h4>
<p>we remove the timeout, and modify bors or w/e to have a webhook "build finished" and we use that to kill it</p>



<a name="206397287"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/aarch64%20self-hosted%20agents%20crashing/near/206397287" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/aarch64.20self-hosted.20agents.20crashing.html#206397287">(Aug 09 2020 at 14:29)</a>:</h4>
<p>another alternative that comes to mind is querying the self-hosted agents API every minute</p>



<a name="206397292"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/aarch64%20self-hosted%20agents%20crashing/near/206397292" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/aarch64.20self-hosted.20agents.20crashing.html#206397292">(Aug 09 2020 at 14:30)</a>:</h4>
<p>and start the timeout when the status moves from <code>idle</code> to <code>active</code></p>



<a name="206397341"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/aarch64%20self-hosted%20agents%20crashing/near/206397341" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Joshua Nelson <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/aarch64.20self-hosted.20agents.20crashing.html#206397341">(Aug 09 2020 at 14:30)</a>:</h4>
<p>I'm still a little confused by the threat model</p>



<a name="206397344"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/aarch64%20self-hosted%20agents%20crashing/near/206397344" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Joshua Nelson <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/aarch64.20self-hosted.20agents.20crashing.html#206397344">(Aug 09 2020 at 14:30)</a>:</h4>
<p>We're worried about a malicious self-hosted runner?</p>



<a name="206397346"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/aarch64%20self-hosted%20agents%20crashing/near/206397346" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/aarch64.20self-hosted.20agents.20crashing.html#206397346">(Aug 09 2020 at 14:30)</a>:</h4>
<p>so</p>



<a name="206397350"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/aarch64%20self-hosted%20agents%20crashing/near/206397350" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/aarch64.20self-hosted.20agents.20crashing.html#206397350">(Aug 09 2020 at 14:31)</a>:</h4>
<p>self-hosted runners for public repositories are broken</p>



<a name="206397352"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/aarch64%20self-hosted%20agents%20crashing/near/206397352" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Alex Gaynor <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/aarch64.20self-hosted.20agents.20crashing.html#206397352">(Aug 09 2020 at 14:31)</a>:</h4>
<p>The threat is someone submits something to CI that's malicious.</p>



<a name="206397353"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/aarch64%20self-hosted%20agents%20crashing/near/206397353" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Joshua Nelson <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/aarch64.20self-hosted.20agents.20crashing.html#206397353">(Aug 09 2020 at 14:31)</a>:</h4>
<p>So why should we trust it to tell us the build started at the right time?</p>



<a name="206397357"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/aarch64%20self-hosted%20agents%20crashing/near/206397357" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/aarch64.20self-hosted.20agents.20crashing.html#206397357">(Aug 09 2020 at 14:31)</a>:</h4>
<p>because anyone can open a PR that changes the CI config to do whatever</p>



<a name="206397358"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/aarch64%20self-hosted%20agents%20crashing/near/206397358" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/aarch64.20self-hosted.20agents.20crashing.html#206397358">(Aug 09 2020 at 14:31)</a>:</h4>
<p>at least it's documented lol</p>



<a name="206397399"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/aarch64%20self-hosted%20agents%20crashing/near/206397399" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/aarch64.20self-hosted.20agents.20crashing.html#206397399">(Aug 09 2020 at 14:32)</a>:</h4>
<p><a href="https://docs.github.com/en/actions/hosting-your-own-runners/about-self-hosted-runners#self-hosted-runner-security-with-public-repositories">https://docs.github.com/en/actions/hosting-your-own-runners/about-self-hosted-runners#self-hosted-runner-security-with-public-repositories</a></p>



<a name="206397409"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/aarch64%20self-hosted%20agents%20crashing/near/206397409" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/aarch64.20self-hosted.20agents.20crashing.html#206397409">(Aug 09 2020 at 14:33)</a>:</h4>
<p>we currently have two layers of defense:</p>
<ul>
<li>we run a custom fork of the runner that rejects any PR build</li>
<li>we run the builds in ephemeral VMs, so even if someone bypasses the protections in the custom runner they can't do persistent damage</li>
</ul>



<a name="206397412"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/aarch64%20self-hosted%20agents%20crashing/near/206397412" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Joshua Nelson <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/aarch64.20self-hosted.20agents.20crashing.html#206397412">(Aug 09 2020 at 14:33)</a>:</h4>
<p>Ok I see, the threat is that arbitrary commands run <em>during</em> the build, not before it starts</p>



<a name="206397413"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/aarch64%20self-hosted%20agents%20crashing/near/206397413" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/aarch64.20self-hosted.20agents.20crashing.html#206397413">(Aug 09 2020 at 14:33)</a>:</h4>
<p>yep</p>



<a name="206397963"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/aarch64%20self-hosted%20agents%20crashing/near/206397963" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/aarch64.20self-hosted.20agents.20crashing.html#206397963">(Aug 09 2020 at 14:47)</a>:</h4>
<p>pushed a couple commits to improve logging, will check tomorrow if the logs confirm my hypothesis</p>



<a name="206397968"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/aarch64%20self-hosted%20agents%20crashing/near/206397968" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/aarch64.20self-hosted.20agents.20crashing.html#206397968">(Aug 09 2020 at 14:47)</a>:</h4>
<p>in that case I'll implement the check-the-api-periodically solution</p>



<hr><p>Last updated: Aug 07 2021 at 22:04 UTC</p>
</html>